Búsqueda | Portal Regional de la BVS

1.

Fast and powerful conditional randomization testing via distillation.

Liu, Molei; Katsevich, Eugene; Janson, Lucas; Ramdas, Aaditya.

Biometrika ; 109(2): 277-293, 2022 Jun.

Artículo en Inglés | MEDLINE | ID: mdl-37416628

RESUMEN

We consider the problem of conditional independence testing: given a response Y and covariates (X,Z), we test the null hypothesis that Yâ««Xâ£Z. The conditional randomization test was recently proposed as a way to use distributional information about Xâ£Z to exactly and nonasymptotically control Type-I error using any test statistic in any dimensionality without assuming anything about Yâ£(X,Z). This flexibility, in principle, allows one to derive powerful test statistics from complex prediction algorithms while maintaining statistical validity. Yet the direct use of such advanced test statistics in the conditional randomization test is prohibitively computationally expensive, especially with multiple testing, due to the requirement to recompute the test statistic many times on resampled data. We propose the distilled conditional randomization test, a novel approach to using state-of-the-art machine learning algorithms in the conditional randomization test while drastically reducing the number of times those algorithms need to be run, thereby taking advantage of their power and the conditional randomization test's statistical guarantees without suffering the usual computational expense. In addition to distillation, we propose a number of other tricks, like screening and recycling computations, to further speed up the conditional randomization test without sacrificing its high power and exact validity. Indeed, we show in simulations that all our proposals combined lead to a test that has similar power to the most powerful existing conditional randomization test implementations, but requires orders of magnitude less computation, making it a practical tool even for large datasets. We demonstrate these benefits on a breast cancer dataset by identifying biomarkers related to cancer stage.

2.

A statistical model for describing and simulating microbial community profiles.

Ma, Siyuan; Ren, Boyu; Mallick, Himel; Moon, Yo Sup; Schwager, Emma; Maharjan, Sagun; Tickle, Timothy L; Lu, Yiren; Carmody, Rachel N; Franzosa, Eric A; Janson, Lucas; Huttenhower, Curtis.

PLoS Comput Biol ; 17(9): e1008913, 2021 09.

Artículo en Inglés | MEDLINE | ID: mdl-34516542

RESUMEN

Many methods have been developed for statistical analysis of microbial community profiles, but due to the complex nature of typical microbiome measurements (e.g. sparsity, zero-inflation, non-independence, and compositionality) and of the associated underlying biology, it is difficult to compare or evaluate such methods within a single systematic framework. To address this challenge, we developed SparseDOSSA (Sparse Data Observations for the Simulation of Synthetic Abundances): a statistical model of microbial ecological population structure, which can be used to parameterize real-world microbial community profiles and to simulate new, realistic profiles of known structure for methods evaluation. Specifically, SparseDOSSA's model captures marginal microbial feature abundances as a zero-inflated log-normal distribution, with additional model components for absolute cell counts and the sequence read generation process, microbe-microbe, and microbe-environment interactions. Together, these allow fully known covariance structure between synthetic features (i.e. "taxa") or between features and "phenotypes" to be simulated for method benchmarking. Here, we demonstrate SparseDOSSA's performance for 1) accurately modeling human-associated microbial population profiles; 2) generating synthetic communities with controlled population and ecological structures; 3) spiking-in true positive synthetic associations to benchmark analysis methods; and 4) recapitulating an end-to-end mouse microbiome feeding experiment. Together, these represent the most common analysis types in assessment of real microbial community environmental and epidemiological statistics, thus demonstrating SparseDOSSA's utility as a general-purpose aid for modeling communities and evaluating quantitative methods. An open-source implementation is available at http://huttenhower.sph.harvard.edu/sparsedossa2.

Asunto(s)

Microbiota , Modelos Estadísticos , Algoritmos , Benchmarking , Biología Computacional/métodos , Simulación por Computador

3.

Statistical Inference with M-Estimators on Adaptively Collected Data.

Zhang, Kelly W; Janson, Lucas; Murphy, Susan A.

Adv Neural Inf Process Syst ; 34: 7460-7471, 2021 Dec.

Artículo en Inglés | MEDLINE | ID: mdl-35757490

RESUMEN

Bandit algorithms are increasingly used in real-world sequential decision-making problems. Associated with this is an increased desire to be able to use the resulting datasets to answer scientific questions like: Did one type of ad lead to more purchases? In which contexts is a mobile health intervention effective? However, classical statistical approaches fail to provide valid confidence intervals when used with data collected with bandit algorithms. Alternative methods have recently been developed for simple models (e.g., comparison of means). Yet there is a lack of general methods for conducting statistical inference using more complex models on data collected with (contextual) bandit algorithms; for example, current methods cannot be used for valid inference on parameters in a logistic regression model for a binary reward. In this work, we develop theory justifying the use of M-estimators-which includes estimators based on empirical risk minimization as well as maximum likelihood-on data collected with adaptive algorithms, including (contextual) bandit algorithms. Specifically, we show that M-estimators, modified with particular adaptive weights, can be used to construct asymptotically valid confidence regions for a variety of inferential targets.

4.

Inference for Batched Bandits.

Zhang, Kelly W; Janson, Lucas; Murphy, Susan A.

Adv Neural Inf Process Syst ; 33: 9818-9829, 2020 Dec.

Artículo en Inglés | MEDLINE | ID: mdl-35002190

RESUMEN

As bandit algorithms are increasingly utilized in scientific studies and industrial applications, there is an associated increasing need for reliable inference methods based on the resulting adaptively-collected data. In this work, we develop methods for inference on data collected in batches using a bandit algorithm. We first prove that the ordinary least squares estimator (OLS), which is asymptotically normal on independently sampled data, is not asymptotically normal on data collected using standard bandit algorithms when there is no unique optimal arm. This asymptotic non-normality result implies that the naive assumption that the OLS estimator is approximately normal can lead to Type-1 error inflation and confidence intervals with below-nominal coverage probabilities. Second, we introduce the Batched OLS estimator (BOLS) that we prove is (1) asymptotically normal on data collected from both multi-arm and contextual bandits and (2) robust to non-stationarity in the baseline reward.

5.

EigenPrism: inference for high dimensional signal-to-noise ratios.

Janson, Lucas; Barber, Rina Foygel; Candès, Emmanuel.

J R Stat Soc Series B Stat Methodol ; 79(4): 1037-1065, 2017 Sep.

Artículo en Inglés | MEDLINE | ID: mdl-29104447

RESUMEN

Consider the following three important problems in statistical inference, namely, constructing confidence intervals for (1) the error of a high-dimensional (p > n) regression estimator, (2) the linear regression noise level, and (3) the genetic signal-to-noise ratio of a continuous-valued trait (related to the heritability). All three problems turn out to be closely related to the little-studied problem of performing inference on the [Formula: see text]-norm of the signal in high-dimensional linear regression. We derive a novel procedure for this, which is asymptotically correct when the covariates are multivariate Gaussian and produces valid confidence intervals in finite samples as well. The procedure, called EigenPrism, is computationally fast and makes no assumptions on coefficient sparsity or knowledge of the noise level. We investigate the width of the EigenPrism confidence intervals, including a comparison with a Bayesian setting in which our interval is just 5% wider than the Bayes credible interval. We are then able to unify the three aforementioned problems by showing that the EigenPrism procedure with only minor modifications is able to make important contributions to all three. We also investigate the robustness of coverage and find that the method applies in practice and in finite samples much more widely than just the case of multivariate Gaussian covariates. Finally, we apply EigenPrism to a genetic dataset to estimate the genetic signal-to-noise ratio for a number of continuous phenotypes.

6.

Predicting patient 'cost blooms' in Denmark: a longitudinal population-based study.

Tamang, Suzanne; Milstein, Arnold; Sørensen, Henrik Toft; Pedersen, Lars; Mackey, Lester; Betterton, Jean-Raymond; Janson, Lucas; Shah, Nigam.

BMJ Open ; 7(1): e011580, 2017 01 11.

Artículo en Inglés | MEDLINE | ID: mdl-28077408

RESUMEN

OBJECTIVES: To compare the ability of standard versus enhanced models to predict future high-cost patients, especially those who move from a lower to the upper decile of per capita healthcare expenditures within 1 year-that is, 'cost bloomers'. DESIGN: We developed alternative models to predict being in the upper decile of healthcare expenditures in year 2 of a sample, based on data from year 1. Our 6 alternative models ranged from a standard cost-prediction model with 4 variables (ie, traditional model features), to our largest enhanced model with 1053 non-traditional model features. To quantify any increases in predictive power that enhanced models achieved over standard tools, we compared the prospective predictive performance of each model. PARTICIPANTS AND SETTING: We used the population of Western Denmark between 2004 and 2011 (2â146â801 individuals) to predict future high-cost patients and characterise high-cost patient subgroups. Using the most recent 2-year period (2010-2011) for model evaluation, our whole-population model used a cohort of 1â557â950 individuals with a full year of active residency in year 1 (2010). Our cost-bloom model excluded the 155â795 individuals who were already high cost at the population level in year 1, resulting in 1â402â155 individuals for prediction of cost bloomers in year 2 (2011). PRIMARY OUTCOME MEASURES: Using unseen data from a future year, we evaluated each model's prospective predictive performance by calculating the ratio of predicted high-cost patient expenditures to the actual high-cost patient expenditures in Year 2-that is, cost capture. RESULTS: Our best enhanced model achieved a 21% and 30% improvement in cost capture over a standard diagnosis-based model for predicting population-level high-cost patients and cost bloomers, respectively. CONCLUSIONS: In combination with modern statistical learning methods for analysing large data sets, models enhanced with a large and diverse set of features led to better performance-especially for predicting future cost bloomers.

Asunto(s)

Costos de la Atención en Salud , Gastos en Salud , Seguro de Salud/estadística & datos numéricos , Dinamarca/epidemiología , Femenino , Costos de la Atención en Salud/estadística & datos numéricos , Encuestas de Atención de la Salud , Gastos en Salud/estadística & datos numéricos , Humanos , Estudios Longitudinales , Masculino , Modelos Econométricos , Ajuste de Riesgo , Revisión de Utilización de Recursos

7.

Optimal Sampling-Based Motion Planning under Differential Constraints: the Driftless Case.

Schmerling, Edward; Janson, Lucas; Pavone, Marco.

IEEE Int Conf Robot Autom ; 2015: 2368-2375, 2015 May.

Artículo en Inglés | MEDLINE | ID: mdl-26618041

RESUMEN

Motion planning under differential constraints is a classic problem in robotics. To date, the state of the art is represented by sampling-based techniques, with the Rapidly-exploring Random Tree algorithm as a leading example. Yet, the problem is still open in many aspects, including guarantees on the quality of the obtained solution. In this paper we provide a thorough theoretical framework to assess optimality guarantees of sampling-based algorithms for planning under differential constraints. We exploit this framework to design and analyze two novel sampling-based algorithms that are guaranteed to converge, as the number of samples increases, to an optimal solution (namely, the Differential Probabilistic RoadMap algorithm and the Differential Fast Marching Tree algorithm). Our focus is on driftless control-affine dynamical models, which accurately model a large class of robotic systems. In this paper we use the notion of convergence in probability (as opposed to convergence almost surely): the extra mathematical flexibility of this approach yields convergence rate bounds - a first in the field of optimal sampling-based motion planning under differential constraints. Numerical experiments corroborating our theoretical results are presented and discussed.

8.

Sarcoma Resection With and Without Vascular Reconstruction: A Matched Case-control Study.

Poultsides, George A; Tran, Thuy B; Zambrano, Eduardo; Janson, Lucas; Mohler, David G; Mell, Matthew W; Avedian, Raffi S; Visser, Brendan C; Lee, Jason T; Ganjoo, Kristen; Harris, E John; Norton, Jeffrey A.

Ann Surg ; 262(4): 632-40, 2015 Oct.

Artículo en Inglés | MEDLINE | ID: mdl-26366542

RESUMEN

OBJECTIVE: To examine the impact of major vascular resection on sarcoma resection outcomes. SUMMARY BACKGROUND DATA: En bloc resection and reconstruction of involved vessels is being increasingly performed during sarcoma surgery; however, the perioperative and oncologic outcomes of this strategy are not well described. METHODS: Patients undergoing sarcoma resection with (VASC) and without (NO-VASC) vascular reconstruction were 1:2 matched on anatomic site, histology, grade, size, synchronous metastasis, and primary (vs. repeat) resection. R2 resections were excluded. Endpoints included perioperative morbidity, mortality, local recurrence, and survival. RESULTS: From 2000 to 2014, 50 sarcoma patients underwent VASC resection. These were matched with 100 NO-VASC patients having similar clinicopathologic characteristics. The rates of any complication (74% vs. 44%, Pâ=â0.002), grade 3 or higher complication (38% vs. 18%, Pâ=â0.024), and transfusion (66% vs. 33%, Pâ<â0.001) were all more common in the VASC group. Thirty-day (2% vs. 0%, Pâ=â0.30) or 90-day mortality (6% vs. 2%, Pâ=â0.24) were not significantly higher. Local recurrence (5-year, 51% vs. 54%, Pâ=â0.11) and overall survival after resection (5-year, 59% vs. 53%, Pâ=â0.67) were similar between the 2 groups. Within the VASC group, overall survival was not affected by the type of vessel involved (artery vs. vein) or the presence of histology-proven vessel wall invasion. CONCLUSIONS: Vascular resection and reconstruction during sarcoma resection significantly increases perioperative morbidity and requires meticulous preoperative multidisciplinary planning. However, the oncologic outcome appears equivalent to cases without major vascular involvement. The anticipated need for vascular resection and reconstruction should not be a contraindication to sarcoma resection.

Asunto(s)

Sarcoma/cirugía , Injerto Vascular , Adolescente , Adulto , Anciano , Anciano de 80 o más Años , Niño , Femenino , Estudios de Seguimiento , Humanos , Masculino , Persona de Mediana Edad , Recurrencia Local de Neoplasia/etiología , Complicaciones Posoperatorias/etiología , Estudios Retrospectivos , Sarcoma/mortalidad , Análisis de Supervivencia , Resultado del Tratamiento , Injerto Vascular/métodos , Adulto Joven

9.

Number of Lymph Nodes Removed and Survival after Gastric Cancer Resection: An Analysis from the US Gastric Cancer Collaborative.

Gholami, Sepideh; Janson, Lucas; Worhunsky, David J; Tran, Thuy B; Squires, Malcolm Hart; Jin, Linda X; Spolverato, Gaya; Votanopoulos, Konstantinos I; Schmidt, Carl; Weber, Sharon M; Bloomston, Mark; Cho, Clifford S; Levine, Edward A; Fields, Ryan C; Pawlik, Timothy M; Maithel, Shishir K; Efron, Bradley; Norton, Jeffrey A; Poultsides, George A.

J Am Coll Surg ; 221(2): 291-9, 2015 Aug.

Artículo en Inglés | MEDLINE | ID: mdl-26206635

RESUMEN

BACKGROUND: Examination of at least 16 lymph nodes (LNs) has been traditionally recommended during gastric adenocarcinoma resection to optimize staging, but the impact of this strategy on survival is uncertain. Because recent randomized trials have demonstrated a therapeutic benefit from extended lymphadenectomy, we sought to investigate the impact of the number of LNs removed on prognosis after gastric adenocarcinoma resection. STUDY DESIGN: We analyzed patients who underwent gastrectomy for gastric adenocarcinoma from 2000 to 2012, at 7 US academic institutions. Patients with M1 disease or R2 resections were excluded. Disease-specific survival (DSS) was calculated using the Kaplan-Meier method and compared using log-rank and Cox regression analyses. RESULTS: Of 742 patients, 257 (35%) had 7 to 15 LNs removed and 485 (65%) had ≥16 LNs removed. Disease-specific survival was not significantly longer after removal of ≥16 vs 7 to 15 LNs (10-year survival, 55% vs 47%, respectively; p = 0.53) for the entire cohort, but was significantly improved in the subset of patients with stage IA to IIIA (10-year survival, 74% vs 57%, respectively; p = 0.018) or N0-2 disease (72% vs 55%, respectively; p = 0.023). Similarly, for patients who were classified to more likely be "true N0-2," based on frequentist analysis incorporating both the number of positive and of total LNs removed, the hazard ratio for disease-related death (adjusted for T stage, R status, grade, receipt of neoadjuvant and adjuvant therapy, and institution) significantly decreased as the number of LNs removed increased. CONCLUSIONS: The number of LNs removed during gastrectomy for adenocarcinoma appears itself to have prognostic implications for long-term survival.

Asunto(s)

Adenocarcinoma/cirugía , Gastrectomía , Escisión del Ganglio Linfático/métodos , Neoplasias Gástricas/cirugía , Adenocarcinoma/mortalidad , Adenocarcinoma/patología , Adulto , Anciano , Femenino , Humanos , Masculino , Persona de Mediana Edad , Estadificación de Neoplasias , Estudios Retrospectivos , Neoplasias Gástricas/mortalidad , Neoplasias Gástricas/patología , Análisis de Supervivencia , Resultado del Tratamiento , Estados Unidos

10.

Optimal Sampling-Based Motion Planning under Differential Constraints: the Drift Case with Linear Affine Dynamics.

Schmerling, Edward; Janson, Lucas; Pavone, Marco.

Proc IEEE Conf Decis Control ; 2015: 2574-2581, 2015 Dec.

Artículo en Inglés | MEDLINE | ID: mdl-26997749

RESUMEN

In this paper we provide a thorough, rigorous theoretical framework to assess optimality guarantees of sampling-based algorithms for drift control systems: systems that, loosely speaking, can not stop instantaneously due to momentum. We exploit this framework to design and analyze a sampling-based algorithm (the Differential Fast Marching Tree algorithm) that is asymptotically optimal, that is, it is guaranteed to converge, as the number of samples increases, to an optimal solution. In addition, our approach allows us to provide concrete bounds on the rate of this convergence. The focus of this paper is on mixed time/control energy cost functions and on linear affine dynamical systems, which encompass a range of models of interest to applications (e.g., double-integrators) and represent a necessary step to design, via successive linearization, sampling-based and provably-correct algorithms for non-linear drift control systems. Our analysis relies on an original perturbation analysis for two-point boundary value problems, which could be of independent interest.

11.

Fast Marching Tree: a Fast Marching Sampling-Based Method for Optimal Motion Planning in Many Dimensions.

Janson, Lucas; Schmerling, Edward; Clark, Ashley; Pavone, Marco.

Int J Rob Res ; 34(7): 883-921, 2015 Jun.

Artículo en Inglés | MEDLINE | ID: mdl-27003958

RESUMEN

In this paper we present a novel probabilistic sampling-based motion planning algorithm called the Fast Marching Tree algorithm (FMT*). The algorithm is specifically aimed at solving complex motion planning problems in high-dimensional configuration spaces. This algorithm is proven to be asymptotically optimal and is shown to converge to an optimal solution faster than its state-of-the-art counterparts, chiefly PRM* and RRT*. The FMT* algorithm performs a "lazy" dynamic programming recursion on a predetermined number of probabilistically-drawn samples to grow a tree of paths, which moves steadily outward in cost-to-arrive space. As such, this algorithm combines features of both single-query algorithms (chiefly RRT) and multiple-query algorithms (chiefly PRM), and is reminiscent of the Fast Marching Method for the solution of Eikonal equations. As a departure from previous analysis approaches that are based on the notion of almost sure convergence, the FMT* algorithm is analyzed under the notion of convergence in probability: the extra mathematical flexibility of this approach allows for convergence rate bounds-the first in the field of optimal sampling-based motion planning. Specifically, for a certain selection of tuning parameters and configuration spaces, we obtain a convergence rate bound of order O(n-1/d+ρ), where n is the number of sampled points, d is the dimension of the configuration space, and ρ is an arbitrarily small constant. We go on to demonstrate asymptotic optimality for a number of variations on FMT*, namely when the configuration space is sampled non-uniformly, when the cost is not arc length, and when connections are made based on the number of nearest neighbors instead of a fixed connection radius. Numerical experiments over a range of dimensions and obstacle configurations confirm our the-oretical and heuristic arguments by showing that FMT*, for a given execution time, returns substantially better solutions than either PRM* or RRT*, especially in high-dimensional configuration spaces and in scenarios where collision-checking is expensive.

12.

An Asymptotically-Optimal Sampling-Based Algorithm for Bi-directional Motion Planning.

Starek, Joseph A; Gomez, Javier V; Schmerling, Edward; Janson, Lucas; Moreno, Luis; Pavone, Marco.

Rep U S ; 2015: 2072-2078, 2015.

Artículo en Inglés | MEDLINE | ID: mdl-27004130

RESUMEN

Bi-directional search is a widely used strategy to increase the success and convergence rates of sampling-based motion planning algorithms. Yet, few results are available that merge both bi-directional search and asymptotic optimality into existing optimal planners, such as PRM*, RRT*, and FMT*. The objective of this paper is to fill this gap. Specifically, this paper presents a bi-directional, sampling-based, asymptotically-optimal algorithm named Bi-directional FMT* (BFMT*) that extends the Fast Marching Tree (FMT*) algorithm to bidirectional search while preserving its key properties, chiefly lazy search and asymptotic optimality through convergence in probability. BFMT* performs a two-source, lazy dynamic programming recursion over a set of randomly-drawn samples, correspondingly generating two search trees: one in cost-to-come space from the initial configuration and another in cost-to-go space from the goal configuration. Numerical experiments illustrate the advantages of BFMT* over its unidirectional counterpart, as well as a number of other state-of-the-art planners.

13.

Effective degrees of freedom: a flawed metaphor.

Janson, Lucas; Fithian, William; Hastie, Trevor J.

Biometrika ; 102(2): 479-485, 2015 Jun 01.

Artículo en Inglés | MEDLINE | ID: mdl-26977114

RESUMEN

To most applied statisticians, a fitting procedure's degrees of freedom is synonymous with its model complexity, or its capacity for overfitting to data. In particular, it is often used to parameterize the bias-variance tradeoff in model selection. We argue that, on the contrary, model complexity and degrees of freedom may correspond very poorly. We exhibit and theoretically explore various fitting procedures for which degrees of freedom is not monotonic in the model complexity parameter, and can exceed the total dimension of the ambient space even in very simple settings. We show that the degrees of freedom for any non-convex projection method can be unbounded.

14.

A Methodology for Robust Multiproxy Paleoclimate Reconstructions and Modeling of Temperature Conditional Quantiles.

Janson, Lucas; Rajaratnam, Bala.

J Am Stat Assoc ; 109(505): 63-77, 2014.

Artículo en Inglés | MEDLINE | ID: mdl-25587203

RESUMEN

Great strides have been made in the field of reconstructing past temperatures based on models relating temperature to temperature-sensitive paleoclimate proxies. One of the goals of such reconstructions is to assess if current climate is anomalous in a millennial context. These regression based approaches model the conditional mean of the temperature distribution as a function of paleoclimate proxies (or vice versa). Some of the recent focus in the area has considered methods which help reduce the uncertainty inherent in such statistical paleoclimate reconstructions, with the ultimate goal of improving the confidence that can be attached to such endeavors. A second important scientific focus in the subject area is the area of forward models for proxies, the goal of which is to understand the way paleoclimate proxies are driven by temperature and other environmental variables. One of the primary contributions of this paper is novel statistical methodology for (1) quantile regression with autoregressive residual structure, (2) estimation of corresponding model parameters, (3) development of a rigorous framework for specifying uncertainty estimates of quantities of interest, yielding (4) statistical byproducts that address the two scientific foci discussed above. We show that by using the above statistical methodology we can demonstrably produce a more robust reconstruction than is possible by using conditional-mean-fitting methods. Our reconstruction shares some of the common features of past reconstructions, but we also gain useful insights. More importantly, we are able to demonstrate a significantly smaller uncertainty than that from previous regression methods. In addition, the quantile regression component allows us to model, in a more complete and flexible way than least squares, the conditional distribution of temperature given proxies. This relationship can be used to inform forward models relating how proxies are driven by temperature.

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

RESUMEN

RESUMEN

RESUMEN

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA